Table of Contents
We think it is safe to say that the Corona Virus has affected many aspects of our lives. From stay at home orders to wearing masks wherever you go, life definitely isn’t the same. One aspect that has been greatly affected is schooling. With learning transitioning to online, many students and teachers are in a new environment trying to figure things out as we go along. We have heard from some students that motivation is hard to come by when you are on the computer all day and their grades are tanking, while others have said they prefer being online with their grades being the best they have ever been. We want to figure out what actually is happening with student and how other external factors such as state Covid relief, how hard the area was affected by corona, and income of the area.
Indented block
Add Descriptions of all libraries used
Add any definitions(Will probably need them when we discuss relief and income)
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
#reading State/County Covid Data from 03/01/2020 to present
# all_countries is supplimental information from the washington post. The have less daily information than the state health departments
all_counties = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv',sep=',')
populations = pd.read_csv('https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/national/totals/nst-est2019-popchg2010_2019.csv',sep=',')
populations = pd.concat([populations[['NAME']],populations[['POPESTIMATE2019']]],axis=1)
populations = populations.drop([0,1,2,3,4])
populations = populations.reset_index(drop=True)
populations = populations.sort_values(by=['NAME'])
areas = pd.read_csv('https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv',sep=',')
areas = areas.sort_values(by=['state'])
areas = areas.reset_index(drop=True)
areas = areas.drop(['state'],axis=1)
pop_density = pd.concat([populations,areas],axis=1)
pop_density['Population Density'] = [None]*52
for index,row in pop_density.iterrows():
pop_density.at[index,'Population Density'] =row['POPESTIMATE2019']/row['area (sq. mi)']
pop_density= pop_density.rename(columns={'NAME': 'State'})
pop_density
| State | POPESTIMATE2019 | area (sq. mi) | Population Density | |
|---|---|---|---|---|
| 0 | Alabama | 4903185 | 52423 | 93.5312 |
| 1 | Alaska | 731545 | 656425 | 1.11444 |
| 2 | Arizona | 7278717 | 114006 | 63.845 |
| 3 | Arkansas | 3017804 | 53182 | 56.7448 |
| 4 | California | 39512223 | 163707 | 241.359 |
| 5 | Colorado | 5758736 | 104100 | 55.3193 |
| 6 | Connecticut | 3565287 | 5544 | 643.089 |
| 7 | Delaware | 973764 | 1954 | 498.344 |
| 8 | District of Columbia | 705749 | 68 | 10378.7 |
| 9 | Florida | 21477737 | 65758 | 326.618 |
| 10 | Georgia | 10617423 | 59441 | 178.621 |
| 11 | Hawaii | 1415872 | 10932 | 129.516 |
| 12 | Idaho | 1787065 | 83574 | 21.383 |
| 13 | Illinois | 12671821 | 57918 | 218.789 |
| 14 | Indiana | 6732219 | 36420 | 184.85 |
| 15 | Iowa | 3155070 | 56276 | 56.0642 |
| 16 | Kansas | 2913314 | 82282 | 35.4065 |
| 17 | Kentucky | 4467673 | 40411 | 110.556 |
| 18 | Louisiana | 4648794 | 51843 | 89.6706 |
| 19 | Maine | 1344212 | 35387 | 37.986 |
| 20 | Maryland | 6045680 | 12407 | 487.28 |
| 21 | Massachusetts | 6892503 | 10555 | 653.008 |
| 22 | Michigan | 9986857 | 96810 | 103.159 |
| 23 | Minnesota | 5639632 | 86943 | 64.8659 |
| 24 | Mississippi | 2976149 | 48434 | 61.4475 |
| 25 | Missouri | 6137428 | 69709 | 88.0436 |
| 26 | Montana | 1068778 | 147046 | 7.26832 |
| 27 | Nebraska | 1934408 | 77358 | 25.0059 |
| 28 | Nevada | 3080156 | 110567 | 27.8578 |
| 29 | New Hampshire | 1359711 | 9351 | 145.408 |
| 30 | New Jersey | 8882190 | 8722 | 1018.37 |
| 31 | New Mexico | 2096829 | 121593 | 17.2447 |
| 32 | New York | 19453561 | 54475 | 357.11 |
| 33 | North Carolina | 10488084 | 53821 | 194.87 |
| 34 | North Dakota | 762062 | 70704 | 10.7782 |
| 35 | Ohio | 11689100 | 44828 | 260.754 |
| 36 | Oklahoma | 3956971 | 69903 | 56.6066 |
| 37 | Oregon | 4217737 | 98386 | 42.8693 |
| 38 | Pennsylvania | 12801989 | 46058 | 277.954 |
| 39 | Rhode Island | 1059361 | 3515 | 301.383 |
| 40 | South Carolina | 5148714 | 1545 | 3332.5 |
| 41 | South Dakota | 884659 | 32007 | 27.6395 |
| 42 | Tennessee | 6829174 | 77121 | 88.5514 |
| 43 | Texas | 28995881 | 42146 | 687.987 |
| 44 | Utah | 3205958 | 268601 | 11.9358 |
| 45 | Vermont | 623989 | 84904 | 7.34935 |
| 46 | Virginia | 8535519 | 9615 | 887.729 |
| 47 | Washington | 7614893 | 42769 | 178.047 |
| 48 | West Virginia | 1792147 | 71303 | 25.1342 |
| 49 | Wisconsin | 5822434 | 24231 | 240.289 |
| 50 | Wyoming | 578759 | 65503 | 8.83561 |
| 51 | Puerto Rico | 3193694 | 97818 | 32.6493 |
cases_state_day = all_counties.copy()
#reformatting list to graph data easier
cases_state_day = cases_state_day.groupby(['date','state'])[['cases','deaths','date']].sum()
cases_state_day.reset_index(inplace=True)
state_list = list(set(cases_state_day['state']))
state_list = np.array_split(state_list, 6)
#list of dataframes. each dataframe contains the number of covid cases, deaths,
#for each day for all US states/territories
df_list = []
for ten_states in state_list:
df = pd.DataFrame()
for state in ten_states:
df = df.append(cases_state_day.drop(cases_state_day[cases_state_day['state'] != state].index))
df_list.append(df)
Predictions:
The information used in these graphs is from https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv a database with the COVID cases and deaths per date for each county in the United States. The database is from a New York Times article.
These graphs are to show a timeline of COVID-19 cases and deaths for all US States and Territories. I hypothesize that in the early months of the pandemic most states had a few hundred cases, and some would have around 1,000-5,000 cases. At the beginning of the pandemic, there was a lot of hysteria and unknown factors surrounding the pandemic. This could have caused people to follow the stay-at-home orders more closely, so the number of cases should be at the lowest. I predict a spike in COVID cases at the beginning of June 2020. During that time, many states were prematurely loosening up COVID restrictions. Months later many states reinstated the restrictions.
In the later months, particularly around the holiday season, I expect the amount of COVID cases to spike. Despite CDC warnings, many Americans were still traveling which would lead to more people spreading the virus. People were also tired of following the stay-at-home regulations, and in general, people became more complacent about the pandemic which could have led to an increase in cases. Also, testing became more common and more efficient. COVID tests in spring 2020 would take up to 14 days to give results, and now tests can give results within the same day. Faster results lead to more people getting tested, and more people getting tested leads to more positive test results. The graphs may show a drop in covid cases and deaths after March 2020. In January, vaccines became available for at-risk groups and essential workers, by March COVID-19 vaccines were made available to everyone. So, I predict that after March the new number of new cases per day will decrease or flatline because more people are vaccinated.
#Graphs for covid cases over time
for df in df_list:
plt.figure(figsize=(16,9))
ax =sns.lineplot(data=df.sort_values('date'), x="date", y="cases", hue="state", linewidth = 3)
plt.xticks( rotation='vertical')
xticks=ax.xaxis.get_major_ticks()
for i in range(len(xticks)):
if not(i%17==0 or i==0):
xticks[i].set_visible(False)
plt.ylabel('Cases x 1,000,000')
plt.title('Cases over Time')
plt.show()
RESULTS FOR CASES OVER TIME:
The 6 different Case over Time graphs shows a timeline of the number of reported COVID-19 cases for each state since the beginning of the pandemic.
In the "Case over Time" graphs most of the lines followed our predictions. We predicted that at the beginning of the pandemic the number of cases would be the lowest. We predicted each state would have 0 to 5,000 reported COVID cases. Many factors could have caused the low COVID-19 case numbers from inefficient testing to Americans taking the "stay at home" regulations and social distancing more seriously. Most states have COVID cases under 10,000 until June 2020. Some outlier states that do not follow the trend of low case numbers, in the beginning, are New York with about 150,000 cases by the end of March 2020, and with around 400,000 by May 2020. New Jersey also had approximately 150,000 cases by April 2020. These states could genuinely have more COVID cases than the other states, or there could have been better testing facilities in these states. Another factor is that New York and New Jersey have big cities and many people commute to and from the city which could have led to more cases.
Surprising information from the graphs is that Alaska, Guam, The Virgin Island, Hawaii, and the Northern Mariana Islands each has under 10,000 COVID cases. All of these locations are separate from the mainland United States, so it is likely that at the beginning of the pandemic these regions completely cut off travel, to prevent the spread of the virus. Another explanation could be that several of these regions require any tourists or people leaving and entering to take COVID-19 Tests, and the people must quarantine while they away test results. This would likely deter tourism and help prevent the virus from even entering the region at all.
Our prediction that COVID cases would spike in June 2020 was correct. Many states have more COVID cases reported starting in June or late May. A likely explanation is because state governments all across the country lifted many COVID restrictions in June 2020. Similarly, in July 2020 many states also reinstated the COVID restrictions because COVID was still prevalent.
Finally, many states had a major increase in COVID cases around December 2020. We predicted that this would happen because of the amount of American’s traveling around the holiday season. Almost all of the states had a major increase of COVID-19 cases from November 2020 - January 2021 with some of the most notable increases being from California, New York, Florida, and Texas. Each of those states surpassed 1,000,000 cases by the beginning of 2021, with Texas and California surpassing 1,500,000 cases.
PREDICTIONS FOR COVID DEATHS OVER TIME:
I expect the COVID-19 deaths to closely follow the graphs of Corona Virus cases. Similarly, I expect the states with the most COVID cases to have the most COVID-related deaths. The states with the most COVID cases naturally have the potential to have the most COVID deaths, and I would be very surprised if a state with low COVID case numbers has a high COVID death toll.
Much like the Cases over Time graphs I expect each state to have under 5000 total COVID deaths up May 2020. It is also likely that the states with high numbers of COVID cases in March and April also have many COVID-related deaths. I also expect there to be a spike in COVID deaths around June 2020 because of states lifting COVID restrictions, and spikes in November 2020 – January 2020 because of the holiday season.
#GRAPHS for covid deaths over time
for df in df_list:
plt.figure(figsize=(16,9))
ax = sns.lineplot(data=df.sort_values('date'), x="date", y="deaths", hue="state",linewidth = 3)
plt.xticks(rotation='vertical')
xticks=ax.xaxis.get_major_ticks()
for i in range(len(xticks)):
if not(i%17==0 or i==0):
xticks[i].set_visible(False)
plt.ylabel('Deaths')
plt.title('Deaths over Time')
plt.show()
Many of the state's first COVID deaths began around the end of March 2020 and the beginning of April 2020. Unsurprisingly the by the end of May 2020 almost 30,000 people in New York had died because of COVID. By the end of May Pennsylvania, Massachusetts, and Illinois had all surpassed 5,000 COVID deaths. In New Jersey, by June 25th there had been over 15,000 COVID deaths. In May 2020 California had over 5,000 deaths, and surprisingly the Florida and Texas had under 5,000 deaths each until August 2020. It is surprising because in the Cases over Time graphs those two states had some of the highest numbers of cases.
The regions like Hawaii, Alaska, Guam, The US Virgin Islands, and the Northern Mariana Islands all had Covid deaths under 1000. This is unsurprising because those regions had cases under 1,000.
In November 2020 most of the state’s COVID death toll began to spike. A likely reason for the increase is the holiday season and Americans traveling. Finally, the states with the most COVID deaths are Pennsylvania with over 25,000 deaths, New Jersey with over 25,000 deaths, Florida with over 35,000 deaths, California with over 60,000 deaths, Texas with 50,000 deaths, and New York with 50,000 deaths. These states have large metropolitan and large populations, so it is no surprise that the most COVID deaths happened in these states. This list is also similar to the states with the largest amount of COVID cases. Later in the project, we determine if there is a correlation between the population and the reported COVID cases and deaths a state has.
bold text# New Section
url1 = 'https://drive.google.com/file/d/1Ou29O44sa_iOTEXPEKYi8Y6-QZq_a8AB/view?usp=sharing'
url2='https://drive.google.com/uc?id=' + url1.split('/')[-2]
incomes = pd.read_excel(url2,header=1)
incomes = incomes.drop([0,1,2,3,4,111,112])
incomes = incomes.rename(columns={'Table H-8. Median Household Income by State: 1984 to 2019': 'State'})
incomes = incomes.rename(columns={'Unnamed: 1': 'Median Income'})
incomes = pd.concat([incomes[['State']],incomes[['Median Income']]],axis=1)
incomes = incomes.drop_duplicates()
incomes = incomes.drop([56,57,58,59])
incomes = incomes.sort_values(by=['State'])
incomes = incomes.reset_index(drop=True)
incomes
plt.figure(figsize=(25,15))
plt.bar(incomes['State'],incomes['Median Income'])
plt.xticks(rotation='vertical')
plt.xticks(size = 20)
plt.yticks(size = 20)
plt.xlabel('State', size = 20)
plt.ylabel('Median Income in Dollars', size = 20)
plt.title('Median Income per State', size = 30)
plt.show()
Using the information found at the Pew Research Center, we went through all of the wealth distribution data found and made an excel sheet with the distribution percentages by state. We then put the excel sheet into a google drive and uploaded it to the colab sheet. Below is the table of the state distributions and also the national distribution. To show a better visuaization of data, we have created pie charts
url1 ='https://drive.google.com/file/d/1SqqIEKXdUVKJsQmNxOn4K9agTvv1RHBP/view?usp=sharing'
url2='https://drive.google.com/uc?id=' + url1.split('/')[-2]
wealth_distributions = pd.read_excel(url2)
pie_chart = wealth_distributions.copy()
pie_chart = pie_chart.set_index('State')
ax = pie_chart.loc['District of Columbia'].plot.pie(title="District of Columbia Wealth Distribution",autopct='%1.1f%%',figsize=(5, 5))
ax.set_ylabel(' ')
plt.show()
ax = pie_chart.loc['Maryland'].plot.pie(title="Maryland Wealth Distribution",autopct='%1.1f%%',figsize=(5, 5))
ax.set_ylabel(' ')
plt.show()
Pie Charts Showing Wealth Distribution:
According to the “Median Income Per State” histogram, Maryland has the highest median income. So, it is not surprising that 52% of Marylanders are middle class. We hypothesize that Maryland will have low COVID-19 cases because of the high median income. The District of Columbia had 26% of its residents in the lower class while Maryland has 20.8%. We hypothesize that Maryland will have fewer COVID cases than DC because of the difference in median income.
density = pop_density.copy()
income_weath_distribution_popDensity = pd.concat([incomes.set_index('State'),wealth_distributions.set_index('State'),density.set_index('State')],axis=1)
income_weath_distribution_popDensity
| Median Income | Upper Class % | Middle Class % | Lower Class % | POPESTIMATE2019 | area (sq. mi) | Population Density | |
|---|---|---|---|---|---|---|---|
| Alabama | 56200 | 17.0 | 52.0 | 31.0 | 4903185 | 52423 | 93.5312 |
| Alaska | 78394 | 19.0 | 58.0 | 23.0 | 731545 | 656425 | 1.11444 |
| Arizona | 70674 | 17.0 | 54.0 | 30.0 | 7278717 | 114006 | 63.845 |
| Arkansas | 54539 | 15.0 | 52.0 | 32.0 | 3017804 | 53182 | 56.7448 |
| California | 78105 | 19.0 | 49.0 | 31.0 | 39512223 | 163707 | 241.359 |
| Colorado | 72499 | 22.0 | 55.0 | 23.0 | 5758736 | 104100 | 55.3193 |
| Connecticut | 87291 | 27.0 | 50.0 | 24.0 | 3565287 | 5544 | 643.089 |
| Delaware | 74194 | 19.0 | 56.0 | 25.0 | 973764 | 1954 | 498.344 |
| District of Columbia | 93111 | 36.0 | 38.0 | 26.0 | 705749 | 68 | 10378.7 |
| Florida | 58368 | 15.0 | 53.0 | 29.0 | 21477737 | 65758 | 326.618 |
| Georgia | 56628 | 20.0 | 51.0 | 29.0 | 10617423 | 59441 | 178.621 |
| Hawaii | 88006 | 18.0 | 58.0 | 24.0 | 1415872 | 10932 | 129.516 |
| Idaho | 65988 | 15.0 | 56.0 | 29.0 | 1787065 | 83574 | 21.383 |
| Illinois | 74399 | 22.0 | 52.0 | 26.0 | 12671821 | 57918 | 218.789 |
| Indiana | 66693 | 18.0 | 57.0 | 26.0 | 6732219 | 36420 | 184.85 |
| Iowa | 66054 | 19.0 | 58.0 | 22.0 | 3155070 | 56276 | 56.0642 |
| Kansas | 73151 | 20.0 | 55.0 | 25.0 | 2913314 | 82282 | 35.4065 |
| Kentucky | 55662 | 16.0 | 53.0 | 31.0 | 4467673 | 40411 | 110.556 |
| Louisiana | 51707 | 18.0 | 48.0 | 34.0 | 4648794 | 51843 | 89.6706 |
| Maine | 66546 | 15.0 | 56.0 | 29.0 | 1344212 | 35387 | 37.986 |
| Maryland | 95572 | 27.0 | 53.0 | 21.0 | 6045680 | 12407 | 487.28 |
| Massachusetts | 87707 | 26.0 | 51.0 | 23.0 | 6892503 | 10555 | 653.008 |
| Michigan | 64119 | 19.0 | 54.0 | 27.0 | 9986857 | 96810 | 103.159 |
| Minnesota | 81426 | 23.0 | 56.0 | 22.0 | 5639632 | 86943 | 64.8659 |
| Mississippi | 44787 | 14.0 | 51.0 | 36.0 | 2976149 | 48434 | 61.4475 |
| Missouri | 60597 | 19.0 | 55.0 | 26.0 | 6137428 | 69709 | 88.0436 |
| Montana | 60195 | 16.0 | 57.0 | 27.0 | 1068778 | 147046 | 7.26832 |
| Nebraska | 73071 | 19.0 | 58.0 | 24.0 | 1934408 | 77358 | 25.0059 |
| Nevada | 70906 | 17.0 | 56.0 | 28.0 | 3080156 | 110567 | 27.8578 |
| New Hampshire | 86900 | 23.0 | 57.0 | 21.0 | 1359711 | 9351 | 145.408 |
| New Jersey | 87726 | 24.0 | 51.0 | 24.0 | 8882190 | 8722 | 1018.37 |
| New Mexico | 53113 | 15.0 | 48.0 | 37.0 | 2096829 | 121593 | 17.2447 |
| New York | 71855 | 19.0 | 49.0 | 32.0 | 19453561 | 54475 | 357.11 |
| North Carolina | 61159 | 18.0 | 53.0 | 29.0 | 10488084 | 53821 | 194.87 |
| North Dakota | 70031 | 24.0 | 53.0 | 21.0 | 762062 | 70704 | 10.7782 |
| Ohio | 64663 | 20.0 | 54.0 | 25.0 | 11689100 | 44828 | 260.754 |
| Oklahoma | 59397 | 17.0 | 54.0 | 20.0 | 3956971 | 69903 | 56.6066 |
| Oregon | 74413 | 18.0 | 55.0 | 27.0 | 4217737 | 98386 | 42.8693 |
| Pennsylvania | 70582 | 19.0 | 54.0 | 27.0 | 12801989 | 46058 | 277.954 |
| Rhode Island | 70151 | 22.0 | 53.0 | 25.0 | 1059361 | 3515 | 301.383 |
| South Carolina | 62028 | 16.0 | 53.0 | 31.0 | 5148714 | 1545 | 3332.5 |
| South Dakota | 64255 | 17.0 | 57.0 | 26.0 | 884659 | 32007 | 27.6395 |
| Tennessee | 56627 | 17.0 | 54.0 | 30.0 | 6829174 | 77121 | 88.5514 |
| Texas | 67444 | 18.0 | 53.0 | 29.0 | 28995881 | 42146 | 687.987 |
| Utah | 84523 | 17.0 | 61.0 | 22.0 | 3205958 | 268601 | 11.9358 |
| Vermont | 74305 | 15.0 | 58.0 | 27.0 | 623989 | 84904 | 7.34935 |
| Virginia | 81313 | 25.0 | 51.0 | 24.0 | 8535519 | 9615 | 887.729 |
| Washington | 82454 | 22.0 | 54.0 | 24.0 | 7614893 | 42769 | 178.047 |
| West Virginia | 53706 | 14.0 | 52.0 | 34.0 | 1792147 | 71303 | 25.1342 |
| Wisconsin | 67355 | 20.0 | 58.0 | 23.0 | 5822434 | 24231 | 240.289 |
| Wyoming | 65134 | 19.0 | 57.0 | 24.0 | 578759 | 65503 | 8.83561 |
| Puerto Rico | NaN | NaN | NaN | NaN | 3193694 | 97818 | 32.6493 |
total_cases_and_deaths = all_counties.copy()
total_cases_and_deaths = total_cases_and_deaths.drop(columns=['date','fips'])
total_cases_and_deaths = total_cases_and_deaths.rename(columns={'state': 'State'})
total_cases_and_deaths = total_cases_and_deaths.set_index('State')
total_cases_and_deaths = total_cases_and_deaths.groupby(['State','county'])[['cases','deaths']].max()
total_cases_and_deaths = total_cases_and_deaths.groupby(['State']).sum()
total_cases_and_deaths = total_cases_and_deaths.drop(['Northern Mariana Islands','Virgin Islands','Guam'])
total_cases_and_deaths
| cases | deaths | |
|---|---|---|
| State | ||
| Alabama | 539848 | 11045.0 |
| Alaska | 69042 | 338.0 |
| Arizona | 872542 | 17472.0 |
| Arkansas | 339255 | 5819.0 |
| California | 3768240 | 62677.0 |
| Colorado | 534281 | 6593.0 |
| Connecticut | 344977 | 8184.0 |
| Delaware | 107145 | 1652.0 |
| District of Columbia | 48530 | 1118.0 |
| Florida | 2289926 | 36113.0 |
| Georgia | 1091755 | 19894.0 |
| Hawaii | 34084 | 490.0 |
| Idaho | 190455 | 2074.0 |
| Illinois | 1380930 | 24772.0 |
| Indiana | 738066 | 13472.0 |
| Iowa | 372075 | 6007.0 |
| Kansas | 313867 | 6203.0 |
| Kentucky | 455234 | 6832.0 |
| Louisiana | 471420 | 10713.0 |
| Maine | 65550 | 803.0 |
| Maryland | 455862 | 9005.0 |
| Massachusetts | 701490 | 17842.0 |
| Michigan | 984245 | 19812.0 |
| Minnesota | 593932 | 7380.0 |
| Mississippi | 315182 | 7260.0 |
| Missouri | 610464 | 9401.0 |
| Montana | 110744 | 1603.0 |
| Nebraska | 222496 | 2423.0 |
| Nevada | 320808 | 5543.0 |
| New Hampshire | 98137 | 1336.0 |
| New Jersey | 1016172 | 26046.0 |
| New Mexico | 200863 | 4130.0 |
| New York | 2081761 | 53548.0 |
| North Carolina | 997897 | 12891.0 |
| North Dakota | 111406 | 1548.0 |
| Ohio | 1091578 | 19719.0 |
| Oklahoma | 454183 | 6884.0 |
| Oregon | 195269 | 2602.0 |
| Pennsylvania | 1189708 | 26805.0 |
| Puerto Rico | 170043 | 2431.0 |
| Rhode Island | 154869 | 2895.0 |
| South Carolina | 588110 | 9644.0 |
| South Dakota | 123707 | 1999.0 |
| Tennessee | 847027 | 12287.0 |
| Texas | 2933437 | 52075.0 |
| Utah | 403576 | 2260.0 |
| Vermont | 23851 | 268.0 |
| Virginia | 670082 | 11135.0 |
| Washington | 425322 | 5685.0 |
| West Virginia | 158247 | 2855.0 |
| Wisconsin | 674105 | 7733.0 |
| Wyoming | 59110 | 712.0 |
income_weath_distribution_popDensity_covid = pd.concat([income_weath_distribution_popDensity,total_cases_and_deaths],axis=1)
wealth_data = income_weath_distribution_popDensity_covid.copy()
income_weath_distribution_popDensity_covid = income_weath_distribution_popDensity_covid.drop(['Puerto Rico'])
#swaping state names to abbreviations to make graphs cleaner
abbrev = pd.read_csv('https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv')
income_weath_distribution_popDensity_covid = pd.concat([income_weath_distribution_popDensity_covid,abbrev.set_index('State')],axis=1)
income_weath_distribution_popDensity_covid = income_weath_distribution_popDensity_covid.rename_axis('State')
income_weath_distribution_popDensity_covid['Cases per 1k Population'] = [None]*51
income_weath_distribution_popDensity_covid['Deaths per 1k Population'] = [None]*51
for index,row in income_weath_distribution_popDensity_covid.iterrows():
income_weath_distribution_popDensity_covid.at[index,'Cases per 1k Population'] =(row['cases']/row['POPESTIMATE2019']) * 1000
income_weath_distribution_popDensity_covid.at[index,'Deaths per 1k Population'] =(row['deaths']/row['POPESTIMATE2019']) * 1000
income_weath_distribution_popDensity_covid
| Median Income | Upper Class % | Middle Class % | Lower Class % | POPESTIMATE2019 | area (sq. mi) | Population Density | cases | deaths | Abbreviation | Cases per 1k Population | Deaths per 1k Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | ||||||||||||
| Alabama | 56200 | 17.0 | 52.0 | 31.0 | 4903185 | 52423 | 93.5312 | 539848 | 11045.0 | AL | 110.101 | 2.25262 |
| Alaska | 78394 | 19.0 | 58.0 | 23.0 | 731545 | 656425 | 1.11444 | 69042 | 338.0 | AK | 94.3783 | 0.462036 |
| Arizona | 70674 | 17.0 | 54.0 | 30.0 | 7278717 | 114006 | 63.845 | 872542 | 17472.0 | AZ | 119.876 | 2.40042 |
| Arkansas | 54539 | 15.0 | 52.0 | 32.0 | 3017804 | 53182 | 56.7448 | 339255 | 5819.0 | AR | 112.418 | 1.92822 |
| California | 78105 | 19.0 | 49.0 | 31.0 | 39512223 | 163707 | 241.359 | 3768240 | 62677.0 | CA | 95.369 | 1.58627 |
| Colorado | 72499 | 22.0 | 55.0 | 23.0 | 5758736 | 104100 | 55.3193 | 534281 | 6593.0 | CO | 92.7775 | 1.14487 |
| Connecticut | 87291 | 27.0 | 50.0 | 24.0 | 3565287 | 5544 | 643.089 | 344977 | 8184.0 | CT | 96.76 | 2.29547 |
| Delaware | 74194 | 19.0 | 56.0 | 25.0 | 973764 | 1954 | 498.344 | 107145 | 1652.0 | DE | 110.032 | 1.69651 |
| District of Columbia | 93111 | 36.0 | 38.0 | 26.0 | 705749 | 68 | 10378.7 | 48530 | 1118.0 | DC | 68.7638 | 1.58413 |
| Florida | 58368 | 15.0 | 53.0 | 29.0 | 21477737 | 65758 | 326.618 | 2289926 | 36113.0 | FL | 106.619 | 1.68142 |
| Georgia | 56628 | 20.0 | 51.0 | 29.0 | 10617423 | 59441 | 178.621 | 1091755 | 19894.0 | GA | 102.827 | 1.87371 |
| Hawaii | 88006 | 18.0 | 58.0 | 24.0 | 1415872 | 10932 | 129.516 | 34084 | 490.0 | HI | 24.0728 | 0.346076 |
| Idaho | 65988 | 15.0 | 56.0 | 29.0 | 1787065 | 83574 | 21.383 | 190455 | 2074.0 | ID | 106.574 | 1.16056 |
| Illinois | 74399 | 22.0 | 52.0 | 26.0 | 12671821 | 57918 | 218.789 | 1380930 | 24772.0 | IL | 108.976 | 1.95489 |
| Indiana | 66693 | 18.0 | 57.0 | 26.0 | 6732219 | 36420 | 184.85 | 738066 | 13472.0 | IN | 109.632 | 2.00112 |
| Iowa | 66054 | 19.0 | 58.0 | 22.0 | 3155070 | 56276 | 56.0642 | 372075 | 6007.0 | IA | 117.929 | 1.90392 |
| Kansas | 73151 | 20.0 | 55.0 | 25.0 | 2913314 | 82282 | 35.4065 | 313867 | 6203.0 | KS | 107.735 | 2.12919 |
| Kentucky | 55662 | 16.0 | 53.0 | 31.0 | 4467673 | 40411 | 110.556 | 455234 | 6832.0 | KY | 101.895 | 1.52921 |
| Louisiana | 51707 | 18.0 | 48.0 | 34.0 | 4648794 | 51843 | 89.6706 | 471420 | 10713.0 | LA | 101.407 | 2.30447 |
| Maine | 66546 | 15.0 | 56.0 | 29.0 | 1344212 | 35387 | 37.986 | 65550 | 803.0 | ME | 48.7646 | 0.597376 |
| Maryland | 95572 | 27.0 | 53.0 | 21.0 | 6045680 | 12407 | 487.28 | 455862 | 9005.0 | MD | 75.4029 | 1.48949 |
| Massachusetts | 87707 | 26.0 | 51.0 | 23.0 | 6892503 | 10555 | 653.008 | 701490 | 17842.0 | MA | 101.776 | 2.58861 |
| Michigan | 64119 | 19.0 | 54.0 | 27.0 | 9986857 | 96810 | 103.159 | 984245 | 19812.0 | MI | 98.554 | 1.98381 |
| Minnesota | 81426 | 23.0 | 56.0 | 22.0 | 5639632 | 86943 | 64.8659 | 593932 | 7380.0 | MN | 105.314 | 1.3086 |
| Mississippi | 44787 | 14.0 | 51.0 | 36.0 | 2976149 | 48434 | 61.4475 | 315182 | 7260.0 | MS | 105.903 | 2.43939 |
| Missouri | 60597 | 19.0 | 55.0 | 26.0 | 6137428 | 69709 | 88.0436 | 610464 | 9401.0 | MO | 99.4658 | 1.53175 |
| Montana | 60195 | 16.0 | 57.0 | 27.0 | 1068778 | 147046 | 7.26832 | 110744 | 1603.0 | MT | 103.617 | 1.49984 |
| Nebraska | 73071 | 19.0 | 58.0 | 24.0 | 1934408 | 77358 | 25.0059 | 222496 | 2423.0 | NE | 115.02 | 1.25258 |
| Nevada | 70906 | 17.0 | 56.0 | 28.0 | 3080156 | 110567 | 27.8578 | 320808 | 5543.0 | NV | 104.153 | 1.79958 |
| New Hampshire | 86900 | 23.0 | 57.0 | 21.0 | 1359711 | 9351 | 145.408 | 98137 | 1336.0 | NH | 72.1749 | 0.982562 |
| New Jersey | 87726 | 24.0 | 51.0 | 24.0 | 8882190 | 8722 | 1018.37 | 1016172 | 26046.0 | NJ | 114.406 | 2.93238 |
| New Mexico | 53113 | 15.0 | 48.0 | 37.0 | 2096829 | 121593 | 17.2447 | 200863 | 4130.0 | NM | 95.7937 | 1.96964 |
| New York | 71855 | 19.0 | 49.0 | 32.0 | 19453561 | 54475 | 357.11 | 2081761 | 53548.0 | NY | 107.012 | 2.75261 |
| North Carolina | 61159 | 18.0 | 53.0 | 29.0 | 10488084 | 53821 | 194.87 | 997897 | 12891.0 | NC | 95.1458 | 1.22911 |
| North Dakota | 70031 | 24.0 | 53.0 | 21.0 | 762062 | 70704 | 10.7782 | 111406 | 1548.0 | ND | 146.19 | 2.03133 |
| Ohio | 64663 | 20.0 | 54.0 | 25.0 | 11689100 | 44828 | 260.754 | 1091578 | 19719.0 | OH | 93.3843 | 1.68696 |
| Oklahoma | 59397 | 17.0 | 54.0 | 20.0 | 3956971 | 69903 | 56.6066 | 454183 | 6884.0 | OK | 114.78 | 1.73971 |
| Oregon | 74413 | 18.0 | 55.0 | 27.0 | 4217737 | 98386 | 42.8693 | 195269 | 2602.0 | OR | 46.2971 | 0.616919 |
| Pennsylvania | 70582 | 19.0 | 54.0 | 27.0 | 12801989 | 46058 | 277.954 | 1189708 | 26805.0 | PA | 92.9315 | 2.09382 |
| Rhode Island | 70151 | 22.0 | 53.0 | 25.0 | 1059361 | 3515 | 301.383 | 154869 | 2895.0 | RI | 146.191 | 2.73278 |
| South Carolina | 62028 | 16.0 | 53.0 | 31.0 | 5148714 | 1545 | 3332.5 | 588110 | 9644.0 | SC | 114.225 | 1.87309 |
| South Dakota | 64255 | 17.0 | 57.0 | 26.0 | 884659 | 32007 | 27.6395 | 123707 | 1999.0 | SD | 139.836 | 2.25963 |
| Tennessee | 56627 | 17.0 | 54.0 | 30.0 | 6829174 | 77121 | 88.5514 | 847027 | 12287.0 | TN | 124.031 | 1.79919 |
| Texas | 67444 | 18.0 | 53.0 | 29.0 | 28995881 | 42146 | 687.987 | 2933437 | 52075.0 | TX | 101.167 | 1.79594 |
| Utah | 84523 | 17.0 | 61.0 | 22.0 | 3205958 | 268601 | 11.9358 | 403576 | 2260.0 | UT | 125.883 | 0.704937 |
| Vermont | 74305 | 15.0 | 58.0 | 27.0 | 623989 | 84904 | 7.34935 | 23851 | 268.0 | VT | 38.2234 | 0.429495 |
| Virginia | 81313 | 25.0 | 51.0 | 24.0 | 8535519 | 9615 | 887.729 | 670082 | 11135.0 | VA | 78.5051 | 1.30455 |
| Washington | 82454 | 22.0 | 54.0 | 24.0 | 7614893 | 42769 | 178.047 | 425322 | 5685.0 | WA | 55.854 | 0.746563 |
| West Virginia | 53706 | 14.0 | 52.0 | 34.0 | 1792147 | 71303 | 25.1342 | 158247 | 2855.0 | WV | 88.3002 | 1.59306 |
| Wisconsin | 67355 | 20.0 | 58.0 | 23.0 | 5822434 | 24231 | 240.289 | 674105 | 7733.0 | WI | 115.777 | 1.32814 |
| Wyoming | 65134 | 19.0 | 57.0 | 24.0 | 578759 | 65503 | 8.83561 | 59110 | 712.0 | WY | 102.132 | 1.23022 |
PREDICTIONS POPULATION VS TOTAL CASES:
In this scatter plot we take the total COVID-19 cases for each state and compare it to the population of that state. We predict that there is a direct correlation between total COVID-19 cases and the state’s population. The states with the largest populations have the most people to get COVID-19. We expect there to be some outliers that are far off from the regression line, but most values should be close to it and have total cases increase as the population increases.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="POPESTIMATE2019", y="cases", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Population', size = 20)
plt.ylabel('Total Cases x 1,000,000', size = 20)
plt.title('Population Vs Total Cases', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['POPESTIMATE2019'], income_weath_distribution_popDensity_covid['cases'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x, y = y-150, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['POPESTIMATE2019'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['cases'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['POPESTIMATE2019'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
RESULTS OF POPULATION VS TOTAL CASES:
Our predictions were correct. There is a direct correlation between total cases and state population. California had 3.5 million covid cases and a population of 40 million, California has the largest population and the largest number of COVID-19 cases. Even then California is under the regression line. The expected number of COVID-19 cases for a state with a population of 40 million is about 3.7 million. Something surprising about this graph is that there are no drastic outliers. All of the plots are close to the regression line.
PREDICTION OF TOTAL POPULATION VS TOTAL DEATHS:
We predict that the plots will follow the Total population vs total cases graphs. Earlier in the project, we showed the similarities between the cases over time graph and the deaths over time graph, so we predict that the population vs total deaths graph will also follow the total cases graph. Since the Total Population VS Total Cases graph did not have outliers and the cases and deaths are so closely related we predict that this graph will have no outliers or very few outliers as well, and most plots will follow the regression line. We predict that there is a direct correlation between the population of a state and the total COVID deaths in a state.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="POPESTIMATE2019", y="deaths", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Population', size = 20)
plt.ylabel('Total Deaths', size = 20)
plt.title('Population Vs Total Deaths', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['POPESTIMATE2019'], income_weath_distribution_popDensity_covid['deaths'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x, y = y-150, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['POPESTIMATE2019'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['deaths'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['POPESTIMATE2019'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
| Median Income | Upper Class % | Middle Class % | Lower Class % | POPESTIMATE2019 | area (sq. mi) | Population Density | cases | deaths | Abbreviation | Cases per 1k Population | Deaths per 1k Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| State | ||||||||||||
| Alabama | 56200 | 17.0 | 52.0 | 31.0 | 4903185 | 52423 | 93.5312 | 539848 | 11045.0 | AL | 110.101 | 2.25262 |
| Alaska | 78394 | 19.0 | 58.0 | 23.0 | 731545 | 656425 | 1.11444 | 69042 | 338.0 | AK | 94.3783 | 0.462036 |
| Arizona | 70674 | 17.0 | 54.0 | 30.0 | 7278717 | 114006 | 63.845 | 872542 | 17472.0 | AZ | 119.876 | 2.40042 |
| Arkansas | 54539 | 15.0 | 52.0 | 32.0 | 3017804 | 53182 | 56.7448 | 339255 | 5819.0 | AR | 112.418 | 1.92822 |
| California | 78105 | 19.0 | 49.0 | 31.0 | 39512223 | 163707 | 241.359 | 3768240 | 62677.0 | CA | 95.369 | 1.58627 |
| Colorado | 72499 | 22.0 | 55.0 | 23.0 | 5758736 | 104100 | 55.3193 | 534281 | 6593.0 | CO | 92.7775 | 1.14487 |
| Connecticut | 87291 | 27.0 | 50.0 | 24.0 | 3565287 | 5544 | 643.089 | 344977 | 8184.0 | CT | 96.76 | 2.29547 |
| Delaware | 74194 | 19.0 | 56.0 | 25.0 | 973764 | 1954 | 498.344 | 107145 | 1652.0 | DE | 110.032 | 1.69651 |
| District of Columbia | 93111 | 36.0 | 38.0 | 26.0 | 705749 | 68 | 10378.7 | 48530 | 1118.0 | DC | 68.7638 | 1.58413 |
| Florida | 58368 | 15.0 | 53.0 | 29.0 | 21477737 | 65758 | 326.618 | 2289926 | 36113.0 | FL | 106.619 | 1.68142 |
| Georgia | 56628 | 20.0 | 51.0 | 29.0 | 10617423 | 59441 | 178.621 | 1091755 | 19894.0 | GA | 102.827 | 1.87371 |
| Hawaii | 88006 | 18.0 | 58.0 | 24.0 | 1415872 | 10932 | 129.516 | 34084 | 490.0 | HI | 24.0728 | 0.346076 |
| Idaho | 65988 | 15.0 | 56.0 | 29.0 | 1787065 | 83574 | 21.383 | 190455 | 2074.0 | ID | 106.574 | 1.16056 |
| Illinois | 74399 | 22.0 | 52.0 | 26.0 | 12671821 | 57918 | 218.789 | 1380930 | 24772.0 | IL | 108.976 | 1.95489 |
| Indiana | 66693 | 18.0 | 57.0 | 26.0 | 6732219 | 36420 | 184.85 | 738066 | 13472.0 | IN | 109.632 | 2.00112 |
| Iowa | 66054 | 19.0 | 58.0 | 22.0 | 3155070 | 56276 | 56.0642 | 372075 | 6007.0 | IA | 117.929 | 1.90392 |
| Kansas | 73151 | 20.0 | 55.0 | 25.0 | 2913314 | 82282 | 35.4065 | 313867 | 6203.0 | KS | 107.735 | 2.12919 |
| Kentucky | 55662 | 16.0 | 53.0 | 31.0 | 4467673 | 40411 | 110.556 | 455234 | 6832.0 | KY | 101.895 | 1.52921 |
| Louisiana | 51707 | 18.0 | 48.0 | 34.0 | 4648794 | 51843 | 89.6706 | 471420 | 10713.0 | LA | 101.407 | 2.30447 |
| Maine | 66546 | 15.0 | 56.0 | 29.0 | 1344212 | 35387 | 37.986 | 65550 | 803.0 | ME | 48.7646 | 0.597376 |
| Maryland | 95572 | 27.0 | 53.0 | 21.0 | 6045680 | 12407 | 487.28 | 455862 | 9005.0 | MD | 75.4029 | 1.48949 |
| Massachusetts | 87707 | 26.0 | 51.0 | 23.0 | 6892503 | 10555 | 653.008 | 701490 | 17842.0 | MA | 101.776 | 2.58861 |
| Michigan | 64119 | 19.0 | 54.0 | 27.0 | 9986857 | 96810 | 103.159 | 984245 | 19812.0 | MI | 98.554 | 1.98381 |
| Minnesota | 81426 | 23.0 | 56.0 | 22.0 | 5639632 | 86943 | 64.8659 | 593932 | 7380.0 | MN | 105.314 | 1.3086 |
| Mississippi | 44787 | 14.0 | 51.0 | 36.0 | 2976149 | 48434 | 61.4475 | 315182 | 7260.0 | MS | 105.903 | 2.43939 |
| Missouri | 60597 | 19.0 | 55.0 | 26.0 | 6137428 | 69709 | 88.0436 | 610464 | 9401.0 | MO | 99.4658 | 1.53175 |
| Montana | 60195 | 16.0 | 57.0 | 27.0 | 1068778 | 147046 | 7.26832 | 110744 | 1603.0 | MT | 103.617 | 1.49984 |
| Nebraska | 73071 | 19.0 | 58.0 | 24.0 | 1934408 | 77358 | 25.0059 | 222496 | 2423.0 | NE | 115.02 | 1.25258 |
| Nevada | 70906 | 17.0 | 56.0 | 28.0 | 3080156 | 110567 | 27.8578 | 320808 | 5543.0 | NV | 104.153 | 1.79958 |
| New Hampshire | 86900 | 23.0 | 57.0 | 21.0 | 1359711 | 9351 | 145.408 | 98137 | 1336.0 | NH | 72.1749 | 0.982562 |
| New Jersey | 87726 | 24.0 | 51.0 | 24.0 | 8882190 | 8722 | 1018.37 | 1016172 | 26046.0 | NJ | 114.406 | 2.93238 |
| New Mexico | 53113 | 15.0 | 48.0 | 37.0 | 2096829 | 121593 | 17.2447 | 200863 | 4130.0 | NM | 95.7937 | 1.96964 |
| New York | 71855 | 19.0 | 49.0 | 32.0 | 19453561 | 54475 | 357.11 | 2081761 | 53548.0 | NY | 107.012 | 2.75261 |
| North Carolina | 61159 | 18.0 | 53.0 | 29.0 | 10488084 | 53821 | 194.87 | 997897 | 12891.0 | NC | 95.1458 | 1.22911 |
| North Dakota | 70031 | 24.0 | 53.0 | 21.0 | 762062 | 70704 | 10.7782 | 111406 | 1548.0 | ND | 146.19 | 2.03133 |
| Ohio | 64663 | 20.0 | 54.0 | 25.0 | 11689100 | 44828 | 260.754 | 1091578 | 19719.0 | OH | 93.3843 | 1.68696 |
| Oklahoma | 59397 | 17.0 | 54.0 | 20.0 | 3956971 | 69903 | 56.6066 | 454183 | 6884.0 | OK | 114.78 | 1.73971 |
| Oregon | 74413 | 18.0 | 55.0 | 27.0 | 4217737 | 98386 | 42.8693 | 195269 | 2602.0 | OR | 46.2971 | 0.616919 |
| Pennsylvania | 70582 | 19.0 | 54.0 | 27.0 | 12801989 | 46058 | 277.954 | 1189708 | 26805.0 | PA | 92.9315 | 2.09382 |
| Rhode Island | 70151 | 22.0 | 53.0 | 25.0 | 1059361 | 3515 | 301.383 | 154869 | 2895.0 | RI | 146.191 | 2.73278 |
| South Carolina | 62028 | 16.0 | 53.0 | 31.0 | 5148714 | 1545 | 3332.5 | 588110 | 9644.0 | SC | 114.225 | 1.87309 |
| South Dakota | 64255 | 17.0 | 57.0 | 26.0 | 884659 | 32007 | 27.6395 | 123707 | 1999.0 | SD | 139.836 | 2.25963 |
| Tennessee | 56627 | 17.0 | 54.0 | 30.0 | 6829174 | 77121 | 88.5514 | 847027 | 12287.0 | TN | 124.031 | 1.79919 |
| Texas | 67444 | 18.0 | 53.0 | 29.0 | 28995881 | 42146 | 687.987 | 2933437 | 52075.0 | TX | 101.167 | 1.79594 |
| Utah | 84523 | 17.0 | 61.0 | 22.0 | 3205958 | 268601 | 11.9358 | 403576 | 2260.0 | UT | 125.883 | 0.704937 |
| Vermont | 74305 | 15.0 | 58.0 | 27.0 | 623989 | 84904 | 7.34935 | 23851 | 268.0 | VT | 38.2234 | 0.429495 |
| Virginia | 81313 | 25.0 | 51.0 | 24.0 | 8535519 | 9615 | 887.729 | 670082 | 11135.0 | VA | 78.5051 | 1.30455 |
| Washington | 82454 | 22.0 | 54.0 | 24.0 | 7614893 | 42769 | 178.047 | 425322 | 5685.0 | WA | 55.854 | 0.746563 |
| West Virginia | 53706 | 14.0 | 52.0 | 34.0 | 1792147 | 71303 | 25.1342 | 158247 | 2855.0 | WV | 88.3002 | 1.59306 |
| Wisconsin | 67355 | 20.0 | 58.0 | 23.0 | 5822434 | 24231 | 240.289 | 674105 | 7733.0 | WI | 115.777 | 1.32814 |
| Wyoming | 65134 | 19.0 | 57.0 | 24.0 | 578759 | 65503 | 8.83561 | 59110 | 712.0 | WY | 102.132 | 1.23022 |
RESULTS OF TOTAL POPULATION VS TOTAL DEATHS:
Our general prediction that there would be a correlation between the two was correct. The plots do not follow along the regression line as closely as the plots do in the population vs cases graph. New York is the biggest outlier. New York has a population of about 20 million people and there were about 50,000 COVID-19 related deaths. According to the regression line, the expected number of deaths is about 35,000. Most plots followed the regression line, and there is a correlation between the total population of a state and the total COVID-19 deaths in the state.
PREDICTION POPULATION DENSITY VS CASES & DEATHS PER 1k POPULATION
We predict that the cases and deaths by population density graphs will be very similar. They will be similar because throughout the project trends with cases and deaths have been similar. We predict that there will be more cases and deaths in the states with the greatest population density. Logically this would make sense because in metropolitan areas there is more person-to-person interaction especially since quarantine restrictions have been lifted, and the public is opening up again. So we think there will be a correlation between the population density and the amount of COVID cases and deaths.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Population Density", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Population Density (Total Population/Area)', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Population Density Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Population Density'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x, y = y-1, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Population Density'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Population Density'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Population Density", y="Deaths per 1k Population",hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Population Density (Total Population/Area)', size = 20)
plt.ylabel('Deaths per 1k Population', size = 20)
plt.title('Population Density Vs Deaths per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Population Density'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Population Density'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Population Density'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
RESULTS POPULATION DENSITY VS CASES & DEATHS PER 1k POPULATION
The graphs do not show a correlation between population density and covid cases and deaths. The densest populated area DC, around the median amount of deaths and cases, compared to the other areas. This would have worked better comparing US cities because Texas has densely populated areas, but there is also a lot of open land in Texas, and the graph doesn’t show that.
PREDICTION MEDIAN INCOME VS CASES PER 1K POPULATION
We expect there to be a correlation between a state’s median income and the total COVID-19 cases in that state. We determined that population is a major contributing factor to the number of COVID-19 cases, So I would expect that states with a larger population of lower-class people have more COVID-19 cases. These people are more likely to be the essential workers that did not have the privilege to work from home, and they would be exposed to the virus more. States with a larger median income would have less of their population working in the essential roles, so there would be fewer cases.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Median Income", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Median Income in Dollars', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Median Income Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Median Income'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x+1, y = y-1, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Median Income'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Median Income'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
RESULTS MEDIAN INCOME VS CASES PER 1K POPULATION
For this graph, we originally looked at the Median Income vs Total Cases, but we realized that the graph was giving meaningless information because the states with the largest population were going to have the largest cases regardless of the median income. To fix this we changed total cases to cases per 1,000 population to normalize the data.
Our predictions were correct, and there seems to be a correlation between median income and cases per 1000. There are several outliers, but the regressing line has a negative slope which means that as states’ median income increases the number of reported cases decreases. A notable outlier is HI, there are 20 cases per 1,000 people and it has one of the largest median incomes. Hawaii is isolated from the mainland United States, and there are strict regulations for tourists to travel there because of COVID, so it’s no wonder that the cases are low.
PREDICTION MEDIAN INCOME VS DEATHS PER 1k POPULATION
We predict that the plots will line up almost the same as in median income vs cases per 1,000 population. There might be even more of a correlation between income and deaths because income can directly affect the quality of treatment individuals suffering from COVID-19, with people from high median income areas receiving higher-quality treatment, so there would be fewer deaths in those states. So, there should be a correlation between Median Income vs Deaths Per 1k Population.
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Median Income", y="Deaths per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Median Income in Dollars', size = 20)
plt.ylabel('Deaths per 1k Population', size = 20)
plt.title('Median Income Vs Deaths per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Median Income'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation'] ):
plt.text(x = x+.01, y = y-.01, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Median Income'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Median Income'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
MEDIAN INCOME VS DEATH PER 1K POPULATION
The graph followed along with the median income vs cases per 1,000 population. The median income vs deaths per 1,000 population do correlate and the regression line has a negative slope again. The states' deaths are far off from the regression line, so a lot of states' deaths are not what is expected based on the regression line.
Next we are going to analyze how the wealth distr
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Upper Class %", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Upper Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Upper Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Upper Class %'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y-1, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Upper Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Upper Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Upper Class %", y="Deaths per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Upper Class %', size = 20)
plt.ylabel('Deaths per 1k Population', size = 20)
plt.title('Upper Class % Vs Deaths per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Upper Class %'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Upper Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Upper Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Middle Class %", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Middle Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Middle Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Middle Class %'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Middle Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Middle Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Middle Class %", y="Deaths per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Middle Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Middle Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Middle Class %'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Middle Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Middle Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Lower Class %", y="Cases per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Lower Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Lower Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Lower Class %'], income_weath_distribution_popDensity_covid['Cases per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Lower Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Cases per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Lower Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
sns.scatterplot(data=income_weath_distribution_popDensity_covid, x="Lower Class %", y="Deaths per 1k Population", hue="State",s=200)
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.xlabel('Lower Class %', size = 20)
plt.ylabel('Cases per 1k Population', size = 20)
plt.title('Lower Class % Vs Cases per 1k Population', size = 30)
for x, y, State in zip(income_weath_distribution_popDensity_covid['Lower Class %'], income_weath_distribution_popDensity_covid['Deaths per 1k Population'],income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((income_weath_distribution_popDensity_covid['Lower Class %'].values).reshape(-1, 1), (income_weath_distribution_popDensity_covid['Deaths per 1k Population'].values).reshape(-1, 1))
x_values = income_weath_distribution_popDensity_covid['Lower Class %'].values
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
states = []
for df in df_list:
states.append(df['state'].tolist())
all = []
for state in states:
for s in state:
if s not in all:
all.append(s)
i=0
cases_daily_increase = {}
death_daily_increase = {}
for df in df_list:
x = df.groupby('state')
for y in range(len(x)):
state = all[i]
if state !='Guam' and state !='Northern Mariana Islands' and state !='Puerto Rico' and state !='Virgin Islands':
data = x.get_group(state).reset_index(drop=True)
cases_reg = LinearRegression().fit(np.array(data.index).reshape(-1, 1),((data['cases'].values/income_weath_distribution_popDensity_covid['POPESTIMATE2019'][state])*1000).reshape(-1, 1))
death_reg = LinearRegression().fit(np.array(data.index).reshape(-1, 1),((data['deaths'].values/income_weath_distribution_popDensity_covid['POPESTIMATE2019'][state])*1000).reshape(-1, 1))
case_m = cases_reg.coef_[0]
death_m = death_reg.coef_[0]
cases_daily_increase[state] = case_m.item(0)
death_daily_increase[state] = death_m.item(0)
i+=1
cases_daily_increase = dict( sorted(cases_daily_increase.items(), key=lambda x: x[0].lower()) )
death_daily_increase = dict( sorted(cases_daily_increase.items(), key=lambda x: x[0].lower()) )
plt.figure(figsize=(35,15))
x_val = income_weath_distribution_popDensity_covid['Median Income'].values
y_val = cases_daily_increase.values()
plt.scatter(x=x_val, y=y_val,s=200)
plt.xlabel('Median Income in Dollars', size = 20)
plt.ylabel('Daily Increase in Cases', size = 20)
plt.title('Daily Increase in Cases Vs Median Income', size = 30)
for x, y, State in zip(x_val, y_val,income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((x_val).reshape(-1, 1), (np.array(list(y_val))).reshape(-1, 1))
x_values = x_val
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
x_val = income_weath_distribution_popDensity_covid['Median Income'].values
y_val = cases_daily_increase.values()
plt.scatter(x=x_val, y=y_val,s=200)
plt.xlabel('Median Income in Dollars', size = 20)
plt.ylabel('Daily Increase in Cases', size = 20)
plt.title('Daily Increase in Cases Vs Median Income', size = 30)
for x, y, State in zip(x_val, y_val,income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((x_val).reshape(-1, 1), (np.array(list(y_val))).reshape(-1, 1))
x_values = x_val
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()
plt.figure(figsize=(35,15))
x_val = income_weath_distribution_popDensity_covid['Upper Class %'].values
y_val = death_daily_increase.values()
plt.scatter(x=x_val, y=y_val,s=200)
plt.xlabel('Upper Class %', size = 20)
plt.ylabel('Daily Increase in Deaths', size = 20)
plt.title('Upper Class % vs Daily Increase in Deaths', size = 30)
for x, y, State in zip(x_val, y_val,income_weath_distribution_popDensity_covid['Abbreviation']):
plt.text(x = x, y = y, s = State,color = 'black',fontsize=12)
reg = LinearRegression().fit((x_val).reshape(-1, 1), (np.array(list(y_val))).reshape(-1, 1))
x_values = x_val
plt.plot(x_values,x_values*(reg.coef_[0].item())+reg.intercept_)
plt.show()